Note: This is a sample solution for the project. Projects will NOT be graded on the basis of how well the submission matches this sample solution. Projects will be graded on the basis of the rubric only.¶

Problem Statement¶


Business Context¶

Understanding customer personality and behavior is pivotal for businesses to enhance customer satisfaction and increase revenue. Segmentation based on a customer's personality, demographics, and purchasing behavior allows companies to create tailored marketing campaigns, improve customer retention, and optimize product offerings.

A leading retail company with a rapidly growing customer base seeks to gain deeper insights into their customers' profiles. The company recognizes that understanding customer personalities, lifestyles, and purchasing habits can unlock significant opportunities for personalizing marketing strategies and creating loyalty programs. These insights can help address critical business challenges, such as improving the effectiveness of marketing campaigns, identifying high-value customer groups, and fostering long-term relationships with customers.

With the competition intensifying in the retail space, moving away from generic strategies to more targeted and personalized approaches is essential for sustaining a competitive edge.


Objective¶

In an effort to optimize marketing efficiency and enhance customer experience, the company has embarked on a mission to identify distinct customer segments. By understanding the characteristics, preferences, and behaviors of each group, the company aims to:

  1. Develop personalized marketing campaigns to increase conversion rates.
  2. Create effective retention strategies for high-value customers.
  3. Optimize resource allocation, such as inventory management, pricing strategies, and store layouts.

As a data scientist tasked with this project, your responsibility is to analyze the given customer data, apply machine learning techniques to segment the customer base, and provide actionable insights into the characteristics of each segment.


Data Dictionary¶

The dataset includes historical data on customer demographics, personality traits, and purchasing behaviors. Key attributes are:

  1. Customer Information

    • ID: Unique identifier for each customer.
    • Year_Birth: Customer's year of birth.
    • Education: Education level of the customer.
    • Marital_Status: Marital status of the customer.
    • Income: Yearly household income (in dollars).
    • Kidhome: Number of children in the household.
    • Teenhome: Number of teenagers in the household.
    • Dt_Customer: Date when the customer enrolled with the company.
    • Recency: Number of days since the customer’s last purchase.
    • Complain: Whether the customer complained in the last 2 years (1 for yes, 0 for no).
  2. Spending Information (Last 2 Years)

    • MntWines: Amount spent on wine.
    • MntFruits: Amount spent on fruits.
    • MntMeatProducts: Amount spent on meat.
    • MntFishProducts: Amount spent on fish.
    • MntSweetProducts: Amount spent on sweets.
    • MntGoldProds: Amount spent on gold products.
  3. Purchase and Campaign Interaction

    • NumDealsPurchases: Number of purchases made using a discount.
    • AcceptedCmp1: Response to the 1st campaign (1 for yes, 0 for no).
    • AcceptedCmp2: Response to the 2nd campaign (1 for yes, 0 for no).
    • AcceptedCmp3: Response to the 3rd campaign (1 for yes, 0 for no).
    • AcceptedCmp4: Response to the 4th campaign (1 for yes, 0 for no).
    • AcceptedCmp5: Response to the 5th campaign (1 for yes, 0 for no).
    • Response: Response to the last campaign (1 for yes, 0 for no).
  4. Shopping Behavior

    • NumWebPurchases: Number of purchases made through the company’s website.
    • NumCatalogPurchases: Number of purchases made using catalogs.
    • NumStorePurchases: Number of purchases made directly in stores.
    • NumWebVisitsMonth: Number of visits to the company’s website in the last month.

Let's start coding!¶

Importing necessary libraries¶

In [1]:
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# to scale the data using z-score
from sklearn.preprocessing import StandardScaler

# to compute distances
from scipy.spatial.distance import cdist, pdist

# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer

# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet

# to suppress warnings
import warnings

warnings.filterwarnings("ignore")

Loading the data¶

In [2]:
# Mounting Google Drive in Google Colab
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [3]:
# loading data into a pandas dataframe
data = pd.read_csv("/content/drive/MyDrive/marketing_campaign.csv", sep="\t")

Data Overview¶

Question 1: What are the data types of all the columns?¶

In [8]:
data.info()
data.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   int64  
 16  NumWebPurchases      2240 non-null   int64  
 17  NumCatalogPurchases  2240 non-null   int64  
 18  NumStorePurchases    2240 non-null   int64  
 19  NumWebVisitsMonth    2240 non-null   int64  
 20  AcceptedCmp3         2240 non-null   int64  
 21  AcceptedCmp4         2240 non-null   int64  
 22  AcceptedCmp5         2240 non-null   int64  
 23  AcceptedCmp1         2240 non-null   int64  
 24  AcceptedCmp2         2240 non-null   int64  
 25  Complain             2240 non-null   int64  
 26  Z_CostContact        2240 non-null   int64  
 27  Z_Revenue            2240 non-null   int64  
 28  Response             2240 non-null   int64  
dtypes: float64(1), int64(25), object(3)
memory usage: 507.6+ KB
Out[8]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Z_CostContact Z_Revenue Response
0 5524 1957 Graduation Single 58138.0 0 0 04-09-2012 58 635 88 546 172 88 88 3 8 10 4 7 0 0 0 0 0 0 3 11 1
1 2174 1954 Graduation Single 46344.0 1 1 08-03-2014 38 11 1 6 2 1 6 2 1 1 2 5 0 0 0 0 0 0 3 11 0
2 4141 1965 Graduation Together 71613.0 0 0 21-08-2013 26 426 49 127 111 21 42 1 8 2 10 4 0 0 0 0 0 0 3 11 0
3 6182 1984 Graduation Together 26646.0 1 0 10-02-2014 26 11 4 20 10 3 5 2 2 0 4 6 0 0 0 0 0 0 3 11 0
4 5324 1981 PhD Married 58293.0 1 0 19-01-2014 94 173 43 118 46 27 15 5 5 3 6 5 0 0 0 0 0 0 3 11 0

Observations:

  • we can see that there is in total 29 columns among which only 3 are objects.
  • we have here 2240 entries or rows.

Question 2: Check the statistical summary of the data. What is the average household income?¶

In [10]:
# the statistical summary of the data
data.describe(include='all').T
Out[10]:
count unique top freq mean std min 25% 50% 75% max
ID 2240.0 NaN NaN NaN 5592.159821 3246.662198 0.0 2828.25 5458.5 8427.75 11191.0
Year_Birth 2240.0 NaN NaN NaN 1968.805804 11.984069 1893.0 1959.0 1970.0 1977.0 1996.0
Education 2240 5 Graduation 1127 NaN NaN NaN NaN NaN NaN NaN
Marital_Status 2240 8 Married 864 NaN NaN NaN NaN NaN NaN NaN
Income 2216.0 NaN NaN NaN 52247.251354 25173.076661 1730.0 35303.0 51381.5 68522.0 666666.0
Kidhome 2240.0 NaN NaN NaN 0.444196 0.538398 0.0 0.0 0.0 1.0 2.0
Teenhome 2240.0 NaN NaN NaN 0.50625 0.544538 0.0 0.0 0.0 1.0 2.0
Dt_Customer 2240 663 31-08-2012 12 NaN NaN NaN NaN NaN NaN NaN
Recency 2240.0 NaN NaN NaN 49.109375 28.962453 0.0 24.0 49.0 74.0 99.0
MntWines 2240.0 NaN NaN NaN 303.935714 336.597393 0.0 23.75 173.5 504.25 1493.0
MntFruits 2240.0 NaN NaN NaN 26.302232 39.773434 0.0 1.0 8.0 33.0 199.0
MntMeatProducts 2240.0 NaN NaN NaN 166.95 225.715373 0.0 16.0 67.0 232.0 1725.0
MntFishProducts 2240.0 NaN NaN NaN 37.525446 54.628979 0.0 3.0 12.0 50.0 259.0
MntSweetProducts 2240.0 NaN NaN NaN 27.062946 41.280498 0.0 1.0 8.0 33.0 263.0
MntGoldProds 2240.0 NaN NaN NaN 44.021875 52.167439 0.0 9.0 24.0 56.0 362.0
NumDealsPurchases 2240.0 NaN NaN NaN 2.325 1.932238 0.0 1.0 2.0 3.0 15.0
NumWebPurchases 2240.0 NaN NaN NaN 4.084821 2.778714 0.0 2.0 4.0 6.0 27.0
NumCatalogPurchases 2240.0 NaN NaN NaN 2.662054 2.923101 0.0 0.0 2.0 4.0 28.0
NumStorePurchases 2240.0 NaN NaN NaN 5.790179 3.250958 0.0 3.0 5.0 8.0 13.0
NumWebVisitsMonth 2240.0 NaN NaN NaN 5.316518 2.426645 0.0 3.0 6.0 7.0 20.0
AcceptedCmp3 2240.0 NaN NaN NaN 0.072768 0.259813 0.0 0.0 0.0 0.0 1.0
AcceptedCmp4 2240.0 NaN NaN NaN 0.074554 0.262728 0.0 0.0 0.0 0.0 1.0
AcceptedCmp5 2240.0 NaN NaN NaN 0.072768 0.259813 0.0 0.0 0.0 0.0 1.0
AcceptedCmp1 2240.0 NaN NaN NaN 0.064286 0.245316 0.0 0.0 0.0 0.0 1.0
AcceptedCmp2 2240.0 NaN NaN NaN 0.013393 0.114976 0.0 0.0 0.0 0.0 1.0
Complain 2240.0 NaN NaN NaN 0.009375 0.096391 0.0 0.0 0.0 0.0 1.0
Z_CostContact 2240.0 NaN NaN NaN 3.0 0.0 3.0 3.0 3.0 3.0 3.0
Z_Revenue 2240.0 NaN NaN NaN 11.0 0.0 11.0 11.0 11.0 11.0 11.0
Response 2240.0 NaN NaN NaN 0.149107 0.356274 0.0 0.0 0.0 0.0 1.0
Observations:¶

-the average household income is 52247.3 dollars.

Question 3: Are there any missing values in the data? If yes, treat them using an appropriate method¶

In [4]:
# prompt: missing values

# Check for missing values
print(data.isnull().sum())

# Treat missing values in 'Income' column using mean imputation
data['Income'] = data['Income'].fillna(data['Income'].mean())

# Verify if missing values are handled
print(data.isnull().sum())
ID                      0
Year_Birth              0
Education               0
Marital_Status          0
Income                 24
Kidhome                 0
Teenhome                0
Dt_Customer             0
Recency                 0
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
MntSweetProducts        0
MntGoldProds            0
NumDealsPurchases       0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
AcceptedCmp3            0
AcceptedCmp4            0
AcceptedCmp5            0
AcceptedCmp1            0
AcceptedCmp2            0
Complain                0
Z_CostContact           0
Z_Revenue               0
Response                0
dtype: int64
ID                     0
Year_Birth             0
Education              0
Marital_Status         0
Income                 0
Kidhome                0
Teenhome               0
Dt_Customer            0
Recency                0
MntWines               0
MntFruits              0
MntMeatProducts        0
MntFishProducts        0
MntSweetProducts       0
MntGoldProds           0
NumDealsPurchases      0
NumWebPurchases        0
NumCatalogPurchases    0
NumStorePurchases      0
NumWebVisitsMonth      0
AcceptedCmp3           0
AcceptedCmp4           0
AcceptedCmp5           0
AcceptedCmp1           0
AcceptedCmp2           0
Complain               0
Z_CostContact          0
Z_Revenue              0
Response               0
dtype: int64
Observations:¶
  • We have imputed the mean value of Income to the missing values.
  • Since we haven't clusters yet we can only impute the overall mean to all the missing values.
  • The 24 missing values in Income are now feeded with the mean of the column.

Question 4: Are there any duplicates in the data?¶

In [13]:
# Check for duplicates
duplicates = data.duplicated()
print(f"Number of duplicate rows: {duplicates.sum()}")
Number of duplicate rows: 0
Observations:¶
  • There is no duplicated rows in the data.

Exploratory Data Analysis¶

Univariate Analysis¶

Question 5: Explore all the variables and provide observations on their distributions. (histograms and boxplots)¶

In [18]:
# Loop through each numerical column in the DataFrame
for col in data.select_dtypes(include=np.number):
    plt.figure(figsize=(12, 4))  # Adjust figure size as needed

    # Histogram
    plt.subplot(1, 2, 1)
    sns.histplot(data[col], kde=True)  # Include KDE for better visualization
    plt.title(f'Histogram of {col}')

    # Boxplot
    plt.subplot(1, 2, 2)
    sns.boxplot(y=data[col])
    plt.title(f'Boxplot of {col}')

    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
Observations:¶
  • We can see that all the customers are born before the year 2000. Also few customers are born the last century.This is maybe due to an error in taping data.
  • for the Income distribution, it is quite symetric suggesting that it follows a known distribution like normal distribution. Some outliers indicating that some customers have very high income.
  • The customers have at most 2 children and 2 teenagers at home and the majority of theme have no kids.
  • The destribution of Recency is similair to the uniforme distribution.
  • The distribution of the amount spent on food( wine, fish, sweets, meat and fruits) and gold are similair : a great bloc spending no money and after that a slow decreasing until the outliers.
  • 50% of the purchases made using a discount have been done less than 2 times.
  • The distribution of the numbre of purchases made through the company's website and using a catalog are right skewed with outliers around 25 times. -13 times is the maximum of numbre of times the customers have buy directly in the store. -We notice that the distribution of the numbre of visits to the company's website is left skewed and that 50% of the visitors have done it less than 6 times per mounth. -Except fot the second compaign, the numbre of customers responding on the compagns that the company has promoted is quite similair. For the second compaign, this number is inferior to the other compaigns.
In [19]:
# univariate analysis categorical count

# Univariate analysis for categorical features
for col in data.select_dtypes(include=['object']):
    plt.figure(figsize=(8, 6))
    sns.countplot(x=col, data=data)
    plt.title(f'Countplot of {col}')
    plt.xticks(rotation=90)  # Rotate x-axis labels for better readability
    plt.show()
    print("-" * 30)
No description has been provided for this image
------------------------------
No description has been provided for this image
------------------------------
No description has been provided for this image
------------------------------

Observations :

  • The majority of the customers are graduated and married.
  • There is no need to look at the id of the customers since it gives no valuable information.

Bivariate Analysis¶

Question 6: Perform multivariate analysis to explore the relationsips between the variables.¶

In [5]:
# Create a heatmap of the correlation matrix
plt.figure(figsize=(30, 20))
num_data=data.select_dtypes(include=np.number)
sns.heatmap(num_data.corr(), annot=True, cmap='Spectral')
plt.title('Correlation Matrix Heatmap')
plt.show()
No description has been provided for this image
  • The year of birth is corralated with the number of kids at home and negatively correlated with these of teenagers : the reason is evident. the oldest adults have more chance to have teengers living outside the house.
  • A surprising fact is that the more the custumer is rich the less he has a kid at home.
  • The Income is correlated positively with the variables represanting the amount of purchase of food and gold. -An interesting fact is that while the variable Kidhome is negatively correlated with all the variables of food and gold, the variable Teenhome isn't correlated with the two variables of the amount of purchasing wine and gold. -The variable NumDealsPurchases is positively correlated with Teenhome.Wich suggests that the more teens the customers have at home the more they use discount in their purchases. This variable is also positively correlated with the number of visits to the compagny's website. -Unfortunetely it would be too long to analyze all the correlations between the variables, that's why we will pick only the most important of theme. -The variables of the response to the compagny's compaigns are correlated between theme. That means that if a custumer respond to a compaign he would likely respond to the others.
In [8]:
# Bivariate analysis for 'Total Spent' (create a new 'TotalSpent' column)
data['TotalSpent'] = data['MntWines'] + data['MntFruits'] + data['MntMeatProducts'] + data['MntFishProducts'] + data['MntSweetProducts'] + data['MntGoldProds']

# Plot TotalSpent against other relevant variables
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Income', y='TotalSpent', data=data)
plt.title('Total Spent vs. Income')
plt.show()

plt.figure(figsize=(10, 6))
sns.boxplot(x='Education', y='TotalSpent', data=data)
plt.title('Total Spent vs. Education')
plt.show()

plt.figure(figsize=(10, 6))
sns.boxplot(x='Education', y='Income', data=data)
plt.title('Income vs. Education')
plt.show()


plt.figure(figsize=(10,6))
sns.boxplot(x='Marital_Status', y='TotalSpent', data=data)
plt.title('Total Spent vs. Marital Status')
plt.show()

# Pairplot for selected numerical features
selected_features = ['Income', 'Recency', 'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth', 'TotalSpent']
sns.pairplot(data[selected_features])
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
  • Among all the relevant variables that exist in the data, we can identify the Income and the total spent as the most important for the compagny.
  • As seen before, Income and total spent are positively correlated.
  • The level of education is a very determinant factor of spennding, as we can notice that the custumers who have a level of education equal or superior to the graduation are the ones who spend more. Is that due to their high income.
  • The plot Income vs. Education confirme that this category of educated customers has more income than the basic one.
  • The customers who live alone spend less money comparatively with the others. -we notice that there is no correlation between the number of visits to the compagny's website and the purchase from that website.

K-means Clustering¶

Question 7 : Select the appropriate number of clusters using the elbow Plot. What do you think is the appropriate number of clusters?¶

In [7]:
# Scale the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(num_data)

# Determine the optimal number of clusters using the Elbow method
score = []
for i in range(1, 14):
    kmeans = KMeans(n_clusters=i)
    kmeans.fit(scaled_data)
    score.append(kmeans.inertia_)

plt.plot(range(1, 14), score)
plt.title('Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('score')
plt.show()
No description has been provided for this image
Observations:¶
  • As we can see the numbre of clusters that the elbow suggests is 2. Let's check with the silouhette method if this number of clusters is relevant.

Question 8 : finalize appropriate number of clusters by checking the silhoutte score as well. Is the answer different from the elbow plot?¶

In [43]:
# Silhouette Analysis
visualizer = SilhouetteVisualizer(KMeans(2))
visualizer.fit(scaled_data)
visualizer.poof()
No description has been provided for this image
Out[43]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 2240 Samples in 2 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [41]:
visualizer = SilhouetteVisualizer(KMeans(3))
visualizer.fit(scaled_data)
visualizer.poof()
No description has been provided for this image
Out[41]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 2240 Samples in 3 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
Observations:¶
  • The silhoutte method seems to confirm that the optimum number of the clusters is 2 since it has the higher score (0.25).

Question 9: Do a final fit with the appropriate number of clusters. How much total time does it take for the model to fit the data?¶

In [20]:
import time

start_time = time.time()
kmeans = KMeans(n_clusters=2, random_state=42) # Use the appropriate number of clusters
kmeans.fit(scaled_data)
end_time = time.time()
time_2clusters_km = end_time - start_time

print(f"Total fitting time: {time_2clusters_km:.4f} seconds")
Total fitting time: 0.0254 seconds
Observations:¶

-To fit the model, the algorithm has taken 0.0254 seconds. We will compare that time later with another algorithm.

Hierarchical Clustering¶

Question 10: Calculate the cophnetic correlation for every combination of distance metrics and linkage. Which combination has the highest cophnetic correlation?¶

In [22]:
# Calculate the cophenetic correlation for every combination of distance metrics and linkage methods
distance_metrics = ['euclidean', 'minkowski', 'chebyshev']
linkage_methods = ['complete', 'average', 'single']

results = []

for metric in distance_metrics:
    for method in linkage_methods:
        Z = linkage(scaled_data, method=method, metric=metric)
        c, coph_dists = cophenet(Z, pdist(scaled_data, metric=metric))
        results.append([metric, method, c])

# Convert the results to a DataFrame for easier view and analysis
cophenetic_df = pd.DataFrame(results, columns=['Distance Metric', 'Linkage Method', 'Cophenetic Correlation'])

# Find the combination with the highest cophenetic correlation
highest_correlation = cophenetic_df['Cophenetic Correlation'].max()
best_combination = cophenetic_df[cophenetic_df['Cophenetic Correlation'] == highest_correlation]
print(f"Highest Cophenetic Correlation: {highest_correlation:.4f}")
print(f"Best combination:\n{best_combination}")

# Display the cophenetic correlations for all combinations
print("\nCophenetic Correlations:")
cophenetic_df
Highest Cophenetic Correlation: 0.9590
Best combination:
  Distance Metric Linkage Method  Cophenetic Correlation
7       chebyshev        average                 0.95896

Cophenetic Correlations:
Out[22]:
Distance Metric Linkage Method Cophenetic Correlation
0 euclidean complete 0.595966
1 euclidean average 0.896379
2 euclidean single 0.843518
3 minkowski complete 0.595966
4 minkowski average 0.896379
5 minkowski single 0.843518
6 chebyshev complete 0.855965
7 chebyshev average 0.958960
8 chebyshev single 0.897534
Observations:¶
  • The best combination of distance metrics and linkage is given by chebyshev and average respectivelly. -The cophenetic correlation for this combination is 0.95896.

Question 11: plot the dendogram for every linkage method with "Euclidean" distance only. What should be the appropriate linkage according to the plot?¶

In [29]:
#hierarchical Clustering Dendrogram
#plt.figure(figsize=(10, 10))
plt.title("Dendrograms")

# Calculate linkage matrix for 'complete' linkage
linked = linkage(scaled_data, 'complete', metric='euclidean')
dendrogram(linked,
            orientation='top',
            distance_sort='descending',
            show_leaf_counts=True)
plt.show()

#plt.figure(figsize=(10, 10))
plt.title("Dendrograms")
linked = linkage(scaled_data, 'average', metric='euclidean')
dendrogram(linked,
            orientation='top',
            distance_sort='descending',
            show_leaf_counts=True)
plt.show()

#plt.figure(figsize=(10, 10))
plt.title("Dendrograms")
linked = linkage(scaled_data, 'single', metric='euclidean')
dendrogram(linked,
            orientation='top',
            distance_sort='descending',
            show_leaf_counts=True)
plt.show()

linked = linkage(scaled_data, 'ward', metric='euclidean')
dendrogram(linked,
            orientation='top',
            distance_sort='descending',
            show_leaf_counts=True)
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
Observations:¶

-Among all the methods of linkage, the ward method is the one who gives the best hierarchy.

Question 12: Check the silhoutte score for the hierchial clustering. What should be the appropriate number of clusters according to this plot?¶

In [18]:
# Calculate Silhouette Score for Hierarchical Clustering
range_n_clusters = range(2,14)
silhouette_scores = []

for n_clusters in range_n_clusters:
    hierarchical_cluster = AgglomerativeClustering(n_clusters=n_clusters, metric='euclidean', linkage='ward')
    cluster_labels = hierarchical_cluster.fit_predict(scaled_data)
    silhouette_avg = silhouette_score(scaled_data, cluster_labels)
    silhouette_scores.append(silhouette_avg)

plt.plot(range_n_clusters, silhouette_scores)
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Score vs. Number of Clusters")
plt.show()

best_n_clusters_hierarchical = range_n_clusters[np.argmax(silhouette_scores)]
best_n_clusters_hierarchical
silhouette_score
print(f"Best number of clusters (Hierarchical) based on silhouette score: {best_n_clusters_hierarchical}")
No description has been provided for this image
Best number of clusters (Hierarchical) based on silhouette score: 2
In [16]:
print(silhouette_scores)
[0.21123833296375513, 0.19192664391999967, 0.20176352122354696, 0.21070970107328682, 0.20995015294045005, 0.18464592084365147, 0.09352798652840015]
Observations:¶
  • According to the plot, the number of clusters showing the maximum silhouette score is 2.
  • This is in total accord with what we found precedently in the K-Means analysis.

Question 13: Fit the Hierarchial clustering model with the appropriate parameters finalized above. How much time does it take to fit the model?¶

In [16]:
import time

start_time = time.time()
hierarchical_cluster = AgglomerativeClustering(n_clusters=2, metric='euclidean', linkage='ward')
cluster_labels = hierarchical_cluster.fit_predict(scaled_data)
end_time = time.time()
time_2clusters_hr = end_time - start_time

print(f"Hierarchical clustering fitting time: {time_2clusters_hr:.4f} seconds")
Hierarchical clustering fitting time: 0.3568 seconds
Observations:¶

-To fit the model, the algorithm has taken 0.1702 seconds. We observe that the K-Means model is 8 times faster than the hierarchical model.

Cluster Profiling and Comparison¶

K-Means Clustering vs Hierarchical Clustering Comparison¶

Question 14: Perform and compare Cluster profiling on both algorithms using boxplots. Based on the all the observaions Which one of them provides better clustering?¶

In [17]:
# K-Means Clustering
kmeans = KMeans(n_clusters=2)
kmeans.fit(scaled_data)

kmeans_labels = kmeans.labels_

# Hierarchical Clustering
hierarchical_cluster = AgglomerativeClustering(n_clusters=2, metric='euclidean', linkage='ward')
hierarchical_labels = hierarchical_cluster.fit_predict(scaled_data)

#add cluster labels to the original dataframe
data['KMeans_Cluster'] = kmeans_labels
data['Hierarchical_Cluster'] = hierarchical_labels


# Cluster Profiling using boxplots
numerical_cols = data.select_dtypes(include=np.number).columns
for col in numerical_cols:
    plt.figure(figsize=(12, 6))

    plt.subplot(1, 2, 1)
    sns.boxplot(data=data, x='KMeans_Cluster', y=col)
    plt.title(f'KMeans Clustering - {col}')

    plt.subplot(1, 2, 2)
    sns.boxplot(data=data, x='Hierarchical_Cluster', y=col)
    plt.title(f'Hierarchical Clustering - {col}')

    plt.tight_layout()  # Adjust layout to prevent overlapping titles
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
Observations:¶
  • It seems that the KMeans algorithm does separate the two clusters with the less outliers than the hierarchical one.

  • We can then conclude that the KMeans clustering in this case is faster and performer than the hierarchical clustering.

  • For the next question we will use this KMean algorithm.

Question 15: Perform Cluster profiling on the data with the appropriate algorithm determined above using a barplot. What observations can be derived for each cluster from this plot?¶

In [34]:
# Cluster Profiling using barplots
for col in numerical_cols:
    plt.figure(figsize=(10, 6))
    column_average=data.groupby('KMeans_Cluster')[col].mean().plot(kind='bar')
    plt.title(f'KMeans Clustering - Average {col} per Cluster')
    plt.xlabel('Cluster')
    plt.ylabel(f'Average {col}')
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [23]:
alpha=data.groupby('KMeans_Cluster').Income.mean()
alpha
Out[23]:
Income
KMeans_Cluster
0 72217.750296
1 39300.960281

Observations:¶
  • The KMeans clusturing gives 2 clusters similar in the term of number of population.
  • The first cluster has more Income than the seconde one (70K vs 40K).
  • We can observe that the cluster 1 has in his majority no kids at home. The fact that the average is not equal to 0 for this cluster is due to the existence of few customers having kids at home.
  • For habits in purchasing foods and golds, the first cluster spends in average more than the second one.
  • This first cluster has also another habit : In avergae his poulation spends more money in store as well as on the company's website. They also tend to purchase using catalogues but not discount.
  • The second cluster seems to respond less to the comagny's compaigns in comparaison with the first cluster. -The complains are more likely to happen with the second cluster than with the first one.

Business Recommedations¶

  • We have seen that 3 clusters are distinctly formed using both methodologies and the clusters are analogous to each other.
  • Cluster 1 has premium customers with a high credit limit and more credit cards, indicating that they have more purchasing power. The customers in this group have a preference for online banking.
  • Cluster 0 has customers who prefer to visit the bank for their banking needs than doing business online or over the phone. They have an average credit limit and a moderate number of credit cards.
  • Cluster 2 has more overhead of customers calling in, and the bank may need to spend money on call centers.

Here are 5–7 actionable business recommendations based on the cluster profiling:


1. Focus on Retaining High-Value Customers (Cluster 3)¶

  • Offer Exclusive Loyalty Programs: Provide tailored loyalty benefits, early access to products, and exclusive discounts to maintain engagement and drive repeat purchases.
  • Upsell and Cross-Sell: Introduce premium products or bundles targeting their high spending patterns across product categories like wines, gold products, and meats.
  • Personalized Campaigns: Use their high response rate to create personalized campaigns highlighting products they prefer.

2. Activate Potential in Moderate-Spending Customers (Cluster 2)¶

  • Incentivize Higher Engagement: Offer targeted discounts or special offers to encourage increased spending and purchases across channels.
  • Educate About Products: Provide content (emails, guides, or social media) showcasing the value and uniqueness of products they don’t purchase frequently.
  • Improve Campaign Effectiveness: Refine campaign messaging based on their moderate response rate to increase acceptance.

3. Reengage Low-Value Customers (Cluster 1)¶

  • Win-Back Campaigns: Implement campaigns specifically aimed at bringing back inactive customers, such as offering steep discounts or limited-time offers.
  • Understand Barriers to Engagement: Conduct surveys or collect feedback to identify reasons for their low purchases and disengagement.
  • Promote Entry-Level Products: Introduce affordable or trial-sized products to ease them into higher spending.

4. Convert Browsers into Buyers (Cluster 0)¶

  • Optimize Website Experience: Since Cluster 0 has high website visits but low spending, improve website navigation, showcase popular products, and streamline the checkout process.
  • Targeted Digital Campaigns: Retarget these users with ads or emails featuring products they browsed but didn’t purchase.
  • Offer Online-Exclusive Discounts: Provide web-only discounts or promotions to convert visits into purchases.

5. Strengthen Digital and Multi-Channel Strategies¶

  • Seamless Omni-Channel Experience: Ensure a consistent shopping experience across all channels (web, catalog, and store) to encourage cross-channel engagement, especially for Clusters 2 and 3.
  • Digital Campaigns for All Clusters: Focus on targeted digital campaigns, particularly for Clusters 0 and 2, as they have moderate to high online engagement.

6. Develop Campaigns to Boost Responses¶

  • Use the insights from Clusters 2 and 3 (which show higher response rates) to refine campaign targeting and messaging. Emulate successful strategies used for Cluster 3 to increase responses across other segments.

7. Leverage Product-Specific Insights¶

  • Promote popular categories (e.g., wines, gold products) to high-value clusters, while running introductory campaigns for less-engaged clusters to familiarize them with premium products.

By focusing on these strategies, the company can enhance engagement, increase revenue, and strengthen customer loyalty across all clusters.

In [ ]:
 
In [ ]: